Skip to content

Conversation

@lbliii
Copy link
Contributor

@lbliii lbliii commented Jan 2, 2026

initial pass at creating SDG docs

Note: I'll be out next week, but feel free to leave any changes and i'll get to them ASAP

lbliii added 3 commits January 2, 2026 10:25
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for the new Ray-based Synthetic Data Generation (SDG) capabilities in NeMo Curator. The documentation covers both simple multilingual Q&A generation and advanced NemotronCC pipelines for text transformation and knowledge extraction.

Key Changes

  • Added tutorial README with quick start examples and command-line reference for all SDG scripts
  • Created comprehensive documentation structure covering LLM client configuration, multilingual Q&A tutorials, and NemotronCC pipeline workflows
  • Updated release notes to reflect SDG feature availability and removed the previous limitation note about SDG being under refactoring

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tutorials/synthetic/README.md Enhanced tutorial README with detailed usage examples, command-line arguments table, and links to documentation
docs/index.md Added Synthetic Data section to the main documentation table of contents
docs/curate-text/synthetic/index.md Created overview page explaining SDG architecture, use cases, and available stages with mermaid diagram
docs/curate-text/synthetic/llm-client.md Added comprehensive LLM client configuration guide covering NVIDIA API, vLLM, TGI endpoints with performance tuning
docs/curate-text/synthetic/multilingual-qa.md Created step-by-step tutorial for generating multilingual Q&A pairs with code examples and CLI reference
docs/curate-text/synthetic/nemotron-cc/index.md Documented NemotronCC pipeline architecture with composable pattern explanation and task configuration
docs/curate-text/synthetic/nemotron-cc/tasks.md Created detailed reference for all five NemotronCC tasks with prompt templates and post-processing logic
docs/curate-text/index.md Added Synthetic Data Generation card to the text curation index page
docs/about/release-notes/index.md Added SDG feature announcement and removed previous limitation note

- NVIDIA API
- Base URL for the API endpoint
* - `--model-name`
- llama-3.3-70b
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value "llama-3.3-70b" doesn't match the actual default used in the example script (synthetic_data_generation_example.py), which is "meta/llama-3.3-70b-instruct". Update this to match the actual implementation for consistency.

Suggested change
- llama-3.3-70b
- meta/llama-3.3-70b-instruct

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +79
## Command-Line Arguments

Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section header "Command-Line Arguments" discusses arguments across different scripts, but the title suggests these are universal. Consider adding clarifying text that differentiates between common arguments (used by multiple scripts) and script-specific arguments, or rename to "Command-Line Reference" for better clarity.

Suggested change
## Command-Line Arguments
## Command-Line Reference
The arguments below are grouped into options shared across multiple example scripts and options specific to particular NemotronCC pipelines. Not every argument applies to every tutorial; refer to each script's `--help` output for the complete, authoritative list.

Copilot uses AI. Check for mistakes.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 2, 2026

Greptile Summary

This PR adds comprehensive documentation for synthetic data generation (SDG) capabilities in NeMo Curator. The documentation includes a well-structured overview, LLM client configuration guide, multilingual Q&A tutorial, and detailed NemotronCC pipeline documentation with clear architecture diagrams and task references.

Major additions:

  • SDG overview page with architecture diagram, use cases, and stage comparison
  • LLM client configuration guide with performance tuning and troubleshooting
  • Multilingual Q&A generation tutorial with step-by-step examples
  • NemotronCC documentation covering 5 specialized tasks (WikiPara, DiverseQA, Distill, ExtractKnowledge, KnowledgeList)
  • Enhanced tutorial README with comprehensive CLI examples

Critical issue:

  • Release notes (docs/about/release-notes/index.md) were drastically reduced, removing comprehensive v26.02 information about Docker, PyPI, video/audio modalities, and architecture changes. The SDG content should be added as a new section rather than replacing existing release information.

Confidence Score: 3/5

  • This PR is generally safe but requires fixing the release notes before merging to avoid losing critical release information
  • The documentation is well-written and comprehensive, but the release notes issue is a critical problem that removes important information users need. The SDG docs themselves are high quality with no logical issues.
  • Pay close attention to docs/about/release-notes/index.md which needs the original v26.02 release content restored

Important Files Changed

Filename Overview
docs/about/release-notes/index.md Release notes drastically reduced from 231 lines to 44 lines, removing comprehensive v26.02 release information about Docker, PyPI, video/audio modalities, deduplication, and architecture changes
docs/curate-text/synthetic/index.md New comprehensive overview page for synthetic data generation with clear architecture diagrams, use cases, and stage comparison table
docs/curate-text/synthetic/llm-client.md Well-structured LLM client configuration guide with comprehensive parameter documentation, examples, and troubleshooting tips
docs/curate-text/synthetic/multilingual-qa.md Step-by-step tutorial for multilingual Q&A generation with clear code examples, CLI usage, and sample output
docs/curate-text/synthetic/nemotron-cc/index.md Comprehensive NemotronCC pipeline documentation with clear architecture diagrams, task comparison table, and quality-based processing strategy
docs/curate-text/synthetic/nemotron-cc/tasks.md Detailed task reference with prompt templates, configuration examples, and post-processing details for each NemotronCC stage

Sequence Diagram

sequenceDiagram
    participant User
    participant Pipeline
    participant LLMClient
    participant NVAPI as NVIDIA API/vLLM
    participant Output
    
    User->>Pipeline: Create SDG Pipeline
    User->>Pipeline: Add QAMultilingualSyntheticStage or NemotronCC Stage
    User->>Pipeline: Configure AsyncOpenAIClient
    
    User->>Pipeline: pipeline.run()
    Pipeline->>LLMClient: Initialize client with rate limiting
    
    alt Multilingual Q&A
        Pipeline->>LLMClient: Generate Q&A pairs in languages
        LLMClient->>NVAPI: Async API calls (max_concurrent_requests)
        NVAPI-->>LLMClient: Generated Q&A responses
        LLMClient->>Pipeline: Return DocumentBatch
        Pipeline->>Pipeline: Apply language filters (optional)
    else NemotronCC Pipeline
        Pipeline->>Pipeline: Preprocessing (tokenize, segment, filter)
        Pipeline->>LLMClient: Transform documents via LLM
        LLMClient->>NVAPI: Batch API calls with retry logic
        NVAPI-->>LLMClient: Transformed text (paraphrased/QA/distilled)
        LLMClient->>Pipeline: Return transformed data
        Pipeline->>Pipeline: Postprocessing (cleanup, quality filter)
    end
    
    Pipeline->>Output: Write to JSONL/Parquet
    Output-->>User: Generated synthetic data
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. docs/about/release-notes/index.md, line 21 (link)

    style: The "What's Next" section placeholder needs completion before release

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@huvunvidia huvunvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a very quick pass. Mainly to determine features/content that are generated by AI agent without manually tested.

Signed-off-by: Lawrence Lane <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 2, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

2. **NeMo Curator with text extras**

```bash
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation command is inconsistent with the standard format used elsewhere in the documentation. The --extra-index-url https://pypi.nvidia.com flag is not needed with uv pip install, and the package name should be quoted.

The command should match the format used in other quickstart guides (e.g., docs/get-started/text.md line 52):

Suggested change
uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[text_cuda12]
uv pip install "nemo-curator[text_cuda12]"

This ensures consistency with the rest of the documentation and follows the recommended installation pattern.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile


- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
- [NemotronCC Pipeline Documentation](../../docs/curate-text/synthetic/nemotron-cc/index.md)
- [Task Reference](../../docs/curate-text/synthetic/nemotron-cc/tasks.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at end of file. Add a trailing newline for POSIX compliance.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: Lawrence Lane <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 13 to +35
# NeMo Curator Release Notes: {{ current_release }}

This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](../../curate-video/index.md) and [audio](../../curate-audio/index.md) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
## Synthetic Data Generation

**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions.
New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:

## Installation Updates
- **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff
- **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts
- **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows:
- **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose
- **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training
- **Distill**: Create condensed, information-dense paraphrases preserving key concepts
- **Extract Knowledge**: Extract factual content as textbook-style passages
- **Knowledge List**: Extract structured fact lists from documents

- **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`)
- **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support
- **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management
- **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality:
Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).

```{list-table} Available Installation Extras
:header-rows: 1
:widths: 25 35 40
* - Extra
- Installation Command
- Description
* - **All Modalities**
- `nemo-curator[all]`
- Complete installation with all modalities and GPU support
* - **Text Curation**
- `nemo-curator[text_cuda12]`
- GPU-accelerated text processing with RAPIDS
* - **Image Curation**
- `nemo-curator[image_cuda12]`
- Image processing with NVIDIA DALI
* - **Audio Curation**
- `nemo-curator[audio_cuda12]`
- Speech recognition with NeMo ASR models
* - **Video Curation**
- `nemo-curator[video_cuda12]`
- Video processing with GPU acceleration
* - **Basic GPU**
- `nemo-curator[cuda12]`
- CUDA utilities without modality-specific dependencies
```

All GPU installations require the NVIDIA PyPI index:
```bash
uv pip install https://pypi.nvidia.com nemo-curator[EXTRA]
```

## New Modalities

### Video

NeMo Curator now supports comprehensive [video data curation](../../curate-video/index.md) with distributed processing capabilities:

- **Video splitting**: [Fixed-stride](../../curate-video/process-data/clipping.md) and [scene-change detection (TransNetV2)](../../curate-video/process-data/clipping.md) for clip extraction
- **Semantic deduplication**: [K-means clustering and pairwise similarity](../../curate-video/process-data/dedup.md) for near-duplicate clip removal
- **Content filtering**: [Motion-based filtering](../../curate-video/process-data/filtering.md) and [aesthetic filtering](../../curate-video/process-data/filtering.md) for quality improvement
- **Embedding generation**: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings
- **Enhanced captioning**: [VL-based caption generation with optional LLM-based rewriting](../../curate-video/process-data/captions-preview.md) (Qwen-VL and Qwen-LM supported) for detailed video descriptions
- **Ray-based distributed architecture**: Scalable video processing with [autoscaling support](../concepts/video/architecture.md)

### Audio

New [audio curation capabilities](../../curate-audio/index.md) for speech data processing:

- **ASR inference**: [Automatic speech recognition](../../curate-audio/process-data/asr-inference/index.md) using NeMo Framework pretrained models
- **Quality assessment**: [Word Error Rate (WER) and Character Error Rate (CER)](../../curate-audio/process-data/quality-assessment/index.md) calculation
- **Speech metrics**: [Duration analysis and speech rate metrics](../../curate-audio/process-data/audio-analysis/index.md) (words/characters per second)
- **Text integration**: Seamless integration with [text curation workflows](../../curate-audio/process-data/text-integration/index.md) via `AudioToDocumentStage`
- **Manifest support**: JSONL manifest format for audio file management

## Modality Refactors

### Text

- **Ray backend migration**: Complete transition from Dask to Ray for distributed [text processing](../../curate-text/index.md)
- **Improved model-based classifier throughput**: Better overlapping of compute between tokenization and inference through [length-based sequence sorting](../../curate-text/process-data/quality-assessment/distributed-classifier.md) for optimal GPU memory utilization
- **Task-centric architecture**: New `Task`-based processing model for finer-grained control
- **Pipeline redesign**: Updated `ProcessingStage` and `Pipeline` architecture with resource specification

### Image

- **Pipeline-based architecture**: Transitioned from legacy `ImageTextPairDataset` to modern [stage-based processing](../../curate-images/index.md) with `ImageReaderStage`, `ImageEmbeddingStage`, and filter stages
- **DALI-based image loading**: New `ImageReaderStage` uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
- **Modular processing stages**: Separate stages for [embedding generation](../../curate-images/process-data/embeddings/index.md), [aesthetic filtering](../../curate-images/process-data/filters/aesthetic.md), and [NSFW filtering](../../curate-images/process-data/filters/nsfw.md)
- **Task-based data flow**: Images processed as `ImageBatch` tasks containing `ImageObject` instances with metadata, embeddings, and classification scores

Learn more about [image curation](../../curate-images/index.md).

## Deduplication Improvements

Enhanced deduplication capabilities across all modalities with improved performance and flexibility:

- **Exact and Fuzzy deduplication**: Updated [rapidsmpf-based shuffle backend](../../reference/infrastructure/gpu-processing.md) for more efficient GPU-to-GPU data transfer and better spilling capabilities
- **Semantic deduplication**: Support for deduplicating [text](../../curate-text/process-data/deduplication/semdedup.md) and [video](../../curate-video/process-data/dedup.md) datasets using unified embedding-based workflows
- **New ranking strategies**: Added `RankingStrategy` which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting [metadata-based ranking](../../curate-text/process-data/deduplication/semdedup.md) to prioritize specific datasets or inputs

## Core Refactors

The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:

```{mermaid}
graph LR
subgraph "User Layer"
P[Pipeline]
S1[ProcessingStage X→Y]
S2[ProcessingStage Y→Z]
S3[ProcessingStage Z→W]
R[Resources<br/>CPU/GPU/NVDEC/NVENC]
end
subgraph "Orchestration Layer"
BE[BaseExecutor Interface]
end
subgraph "Backend Layer"
XE[XennaExecutor<br/>Production Ready]
RAP[RayActorPoolExecutor<br/>Experimental]
RDE[RayDataExecutor<br/>Experimental]
end
subgraph "Adaptation Layer"
XA[Xenna Adapter]
RAPA[Ray Actor Adapter]
RDA[Ray Data Adapter]
end
subgraph "Execution Layer"
X[Cosmos-Xenna<br/>Streaming/Batch]
RAY1[Ray Actor Pool<br/>Load Balancing]
RAY2[Ray Data API<br/>Dataset Processing]
end
P --> S1
P --> S2
P --> S3
S1 -.-> R
S2 -.-> R
S3 -.-> R
P --> BE
BE --> XE
BE --> RAP
BE --> RDE
XE --> XA
RAP --> RAPA
RDE --> RDA
XA --> X
RAPA --> RAY1
RDA --> RAY2
style XE fill:#90EE90
style RAP fill:#FFE4B5
style RDE fill:#FFE4B5
style P fill:#E6F3FF
style BE fill:#F0F8FF
```

### Pipelines

- **New Pipeline API**: Ray-based pipeline execution with `BaseExecutor` interface
- **Multiple backends**: Support for [Xenna, Ray Actor Pool, and Ray Data execution backends](../../reference/infrastructure/execution-backends.md)
- **Resource specification**: Configurable CPU and GPU memory requirements per stage
- **Stage composition**: Improved stage validation and execution orchestration

### Stages

- **ProcessingStage redesign**: Generic `ProcessingStage[X, Y]` base class with type safety
- **Resource requirements**: Built-in resource specification for CPU and GPU memory
- **Backend adapters**: Stage adaptation layer for different Ray orchestration systems
- **Input/output validation**: Enhanced type checking and data validation

## Tutorials

- **Text tutorials**: Updated all [text curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/text) to use new Ray-based API
- **Image tutorials**: Migrated [image processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/image) to unified backend
- **Audio tutorials**: New [audio curation tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/audio)
- **Video tutorials**: New [video processing tutorials](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials/video)

For all tutorial content, refer to the [tutorials directory](https://github.com/NVIDIA-NeMo/Curator/tree/main/tutorials) in the NeMo Curator GitHub repository.

## Known Limitations

> (Pending Refactor in Future Release)
### Generation

- **Synthetic data generation**: Synthetic text generation features are being refactored for Ray compatibility
- **Hard negative mining**: Retrieval-based data generation workflows under development

### PII

- **PII processing**: Personal Identifiable Information removal tools are being updated for Ray backend
- **Privacy workflows**: Enhanced privacy-preserving data curation capabilities in development

### Blending & Shuffling

- **Data blending**: Multi-source dataset blending functionality being refactored
- **Dataset shuffling**: Large-scale data shuffling operations under development

## Docs Refactor

- **Local preview capability**: Improved documentation build system with local preview support
- **Modality-specific guides**: Comprehensive documentation for each supported modality ([text](../../curate-text/index.md), [image](../../curate-images/index.md), [audio](../../curate-audio/index.md), [video](../../curate-video/index.md))
- **API reference**: Complete [API documentation](../../apidocs/index.rst) with type annotations and examples

---

## What's Next

The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.
The next release will focus on ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release notes have been drastically reduced from 231 lines to just 44 lines, removing all the comprehensive release information about the v26.02 release. The original content included detailed information about:

  • Installation updates (Docker, UV, PyPI improvements)
  • New modalities (Video and Audio support)
  • Modality refactors (Text and Image improvements)
  • Deduplication improvements
  • Core architecture refactors (Pipeline and Stage redesigns)
  • Tutorials updates
  • Known limitations

This content should not be completely replaced. Instead, the SDG documentation should be added to the existing release notes as a new section, not used as a replacement for all the other release information.

The release notes serve as a critical reference for users understanding what changed in the release. Removing this information will confuse users and make it difficult for them to understand the full scope of the v26.02 release.

| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` |
---

## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline after the "Additional Resources" heading. The markdown formatting is incorrect - there should be a blank line between the heading and the list items.

Suggested change
## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
## Additional Resources
- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` |
---

## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline between heading and bullet list. Add a blank line after "## Additional Resources" for proper markdown rendering.

Suggested change
## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
## Additional Resources
- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tutorials/synthetic/README.md, line 103 (link)

    style: Missing trailing newline at end of file.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

12 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +15 to +28
## Synthetic Data Generation

**Migrating from a previous version of NeMo Curator?** Refer to the {ref}`Migration Guide <migration-guide>` for step-by-step instructions and the {ref}`Migration FAQ <migration-faq>` for common questions.
New Ray-based synthetic data generation capabilities for creating and augmenting training data using LLMs:

## Installation Updates
- **LLM Client Infrastructure**: OpenAI-compatible async/sync clients with automatic rate limiting, retry logic, and exponential backoff
- **Multilingual Q&A Generation**: Generate synthetic Q&A pairs across multiple languages using customizable prompts
- **NemotronCC Pipelines**: Advanced text transformation and knowledge extraction workflows:
- **Wikipedia Paraphrasing**: Improve low-quality text by rewriting in Wikipedia-style prose
- **Diverse QA**: Generate diverse question-answer pairs for reading comprehension training
- **Distill**: Create condensed, information-dense paraphrases preserving key concepts
- **Extract Knowledge**: Extract factual content as textbook-style passages
- **Knowledge List**: Extract structured fact lists from documents

- **New Docker container**: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) (`nvcr.io/nvidia/nemo-curator:{{ container_version }}`)
- **Docker file to build own image**: Simplified [Dockerfile](https://github.com/NVIDIA-NeMo/Curator/blob/main/docker/Dockerfile) structure for custom container builds with FFmpeg support
- **UV source installations**: Integrated UV package manager (v0.8.22) for faster dependency management
- **PyPI improvements**: Enhanced PyPI installation with modular extras for targeted functionality:
Learn more in the [Synthetic Data Generation documentation](../../curate-text/synthetic/index.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The release notes have been reduced from 231 lines to 44 lines, removing all comprehensive v26.02 release information including Docker updates, PyPI improvements, video/audio modalities, deduplication improvements, and architecture refactors. The SDG documentation should be added to existing release notes, not replace them entirely. Users need the full scope of v26.02 changes for understanding what's new in the release.

| **[Multilingual Q&A Generation](synthetic_data_generation_example.py)** | Generate synthetic Q&A pairs in multiple languages using LLMs | `synthetic_data_generation_example.py` |
---

## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Missing blank line between heading and list.

Suggested change
## Additional Resources- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)
## Additional Resources
- [LLM Client Configuration](../../docs/curate-text/synthetic/llm-client.md)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@ayushdg ayushdg requested a review from satishra-ai January 13, 2026 21:00
@satishra-ai
Copy link

doc doesn't mention on how to generate the data required in below CLI:

Process Parquet input files

python nemotron_cc/nemotron_cc_sdg_high_quality_example_pipeline.py
--task diverse_qa
--tokenizer meta-llama/Llama-3.3-70B-Instruct
--input-parquet-path ./my_data/*.parquet
--output-path ./synthetic_output
--output-format parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants